Critique: ”Filtering duplicate reads from 454 pyrosequencing”

نویسندگان

  • Susanne Balzer
  • Ketil Malde
  • Markus A. Grohme
  • Inge Jonassen
چکیده

The paper describes a novel approach for filtering duplicate reads from 454 pyrosequencing data. This problem is motivated by the need of reduce sequencing errors and artifically duplicated reads in some applications such as de-novo whole genome sequencing or metagenomics. Existing solutions are often based on nucleotide sequences, while raw flowgram values, which contain additional information, are unsused. Authors present a new software tool JATAC, which can be used for accurate duplicates filtering and accepts 454 flowgrams as input. Approach is based on reads clustering, performed by calculating all pairwise distances between reads. For distances calculation, probability of homopolymers having same length when observing corresponding flowgram is being used. The method was benchmarked on 3 different bacterial datasets and it showed better results, compaired to existing solutions. Advantages of this approach are clear: usage of raw flowgram data gives more accurate estimation of reads similarity and, as a result, more precise duplicates filtering. While it looks quite sexy, there are some weak points in presented paper. First of all, hierarchial clustering is performed according to some threshold constant, which was chosen empirically. It is unclear, why same constant should be used for clusters with different size and different read diversity. There are already some existing approaches (which are more accurate) with probabilistic models for reads clustering (see [1]). Secondly, there are some more unmotivated empirical constants in distance calculating algorithm, which corresponds to maximum size of homopolymer, involved into flowgrams comparison. Thirdly, no running time and memory usage benchmarking was performed. To sum up, despite of not very accurate reads clustering, presented software seems to be more accurate, comparing with existing solutions. Software could be used for some general filtering 454 sequencing data. Authors do not recommend using JATAC for IonTorrent data in present version.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Filtering duplicate reads from 454 pyrosequencing data

MOTIVATION Throughout the recent years, 454 pyrosequencing has emerged as an efficient alternative to traditional Sanger sequencing and is widely used in both de novo whole-genome sequencing and metagenomics. Especially the latter application is extremely sensitive to sequencing errors and artificially duplicated reads. Both are common in 454 pyrosequencing and can create a strong bias in the e...

متن کامل

454 Pyrosequencing to Describe Microbial Eukaryotic Community Composition, Diversity and Relative Abundance: A Test for Marine Haptophytes

Next generation sequencing of ribosomal DNA is increasingly used to assess the diversity and structure of microbial communities. Here we test the ability of 454 pyrosequencing to detect the number of species present, and assess the relative abundance in terms of cell numbers and biomass of protists in the phylum Haptophyta. We used a mock community consisting of equal number of cells of 11 hapt...

متن کامل

Correction of sequence-dependent ambiguous bases (Ns) from the 454 pyrosequencing system

Pyrosequencing of the 16S ribosomal RNA gene (16S) has become one of the most popular methods to assess microbial diversity. Pyrosequencing reads containing ambiguous bases (Ns) are generally discarded based on the assumptions of their non-sequence-dependent formation and high error rates. However, taxonomic composition differed by removal of reads with Ns. We determined whether Ns from pyroseq...

متن کامل

Primer and platform effects on 16S rRNA tag sequencing

Sequencing of 16S rRNA gene tags is a popular method for profiling and comparing microbial communities. The protocols and methods used, however, vary considerably with regard to amplification primers, sequencing primers, sequencing technologies; as well as quality filtering and clustering. How results are affected by these choices, and whether data produced with different protocols can be meani...

متن کامل

Lessons learned from microsatellite development for nonmodel organisms using 454 pyrosequencing.

Microsatellites, also known as simple sequence repeats (SSRs), are among the most commonly used marker types in evolutionary and ecological studies. Next Generation Sequencing techniques such as 454 pyrosequencing allow the rapid development of microsatellite markers in nonmodel organisms. 454 pyrosequencing is a straightforward approach to develop a high number of microsatellite markers. There...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013